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We  present  a  fine  grained,  massively  parallel  -SIMD  archi¬ 
tect  tire.  called  the  data  structure  accelerator,  and  demon¬ 
strate  its  nse  in  a  number  of  problems  in  computational 
geometry.  This  architect  lire  is  extremely  dense  and  highly 
scalable.  Systems  of  10®  processing  elements  can  be  feasibly 
embedded  in  workstations.  We  advocate  that  this  architec¬ 
ture  be  used  in  tandem  with  conventional,  single  sequence 
machines  and  with  small  scale,  shared  memory  multipro¬ 
cessors.  A  language  for  programming  such  heterogeneous 
systems  is  presented  that  smoothly  encorporates  the  SIMD 
instructions  of  the  data  structure  accelerator  with  conven¬ 
tional  single  sequence  code. 


1  Introduction 


There  has  been  a  significant  body  of  work  on  single 
instruction,  multiple  data  (SIMD)  computer  architec¬ 
tures  in  the  past.  This  work  ranges  from  the  MPP 
machine  developed  at  Goodyear  [4]  to  the  Connection 
Machine[3]  and  the  Masspar  MP-1  today  .  These  ma¬ 
chines  are  generally  viewed  (and  sometimes  even  adver¬ 
tised)  as  large  “supercomputers.”  They  are  intended 
to  be  used  for  large  problems  that  are  not  practical 
on  smaller  machines.  However,  these  SIMD  machines 
are  not  uniformly  better  than  conventional  processing 
elements  or  the  new  generation  of  MIMD  processors 
for  all  problems.  It  is  rarely  cost,  effective  to  couple  the 
large  SIMD  machines  with  other  types  of  processing  el¬ 
ements.  so  that  a  combined,  heterogeneous  machine  can 
be  applied  to  a  problem — each  component  performing 
those  computations  at  which  it  is  best.  Furthermore, 
existing  SIMD  machines  have  a  relatively  modest  num¬ 
ber  of  processing  elements  (the  Connection  Machine 
can  handle  up  about  10®-8). 

The  chunks  of  computation  performed  on  the  cur- 
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rent  generation  SIMD  machines  tend  to  be  quite  large. 
There  is  relatively  little  cooperatioy  between  the  SIMD 
machine  and  the  host  machines  during  the  computa¬ 
tion.  This  reinforces  the  division  between  SIMD  com¬ 
putations  and  more  conventional  approaches. 

We  feel  this  division  is  counterproductive.  To  illus¬ 
trate  this  a  few  algorithms  are  presented  in  Section  4 
that  use  a  mixture  of  SIMD  and  SISD  constructs  to 
achieve  high  performance.  Interestingly,  these  algo¬ 
rithms  are  quite  simple  compared  to  the  optimal,  single 
sequence  algorithms  for  the  same  problem,  and  yet  the 
SIMD  algorithms  perform  substantially  better.  The 
problems  these  techniques  solve  arise  as  parts  of  much 
larger  problems,  such  as  mechanical  simulation,  that 
are  quite  difficult  to  parallelize  using  SIMD  techniques. 
It  would  be  wildly  impractical  to  devote  a  Connection 
Machine  their  resolution  in  most  cases. 

In  this  paper  we  suggest  that  the  real  role  for  SIMD 
architectures  is  not  as  “stand-alone  supercomputers," 
but  as  integral  components  of  heterogeneous  machines 
that  consist  of  both  SIMD  and  SISD  (or  MIMD) 
components— each  component  responsible  for  the  por¬ 
tions  of  a  computation  at  which  they  are  most  effec¬ 
tive.  This  often  means  managing  large,  memory  resi¬ 
dent  data  structures,  or  performing  simple  operations 
on  large  blocks  of  data.  Thus,  a  natural  way  to  merge 
a  SIMD  architecture  with  a  conventional  one  is  to  inte¬ 
grate  the  SIMD  processing  elements  into  the  memory 
system. 

We  argue  for  simple  SIMD  architectures  that  can 
be  built  relatively  cheaply  and  with  very  high  density. 
We  are  interested  developing  systems  with  upwards  of 
10®  processing  elements  that  can  be  used  in  personal 
workstations  and  upwards  of  108  for  “supercomputing 
applications.”  The  individual  elements  of  such  SIMD 
architectures  must  be  quite  simple  to  have  this  type  of 
density  and  their  interconnect  must  also  be  simple  to 
allow  for  scalability  and  the  size  system  we  are  inter¬ 
ested.  We  have  developed  a  class  of  SIMD  architectures 
that  meets  these  criteria  which  we  call  Data  Structure 
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Accelf  rat  or*  (DSA).1  In  Section  2  we  present  t  heir  or¬ 
ganization  in  detail. 

One  of  t he  problems  with  dealing  with  heterogeneous 
architectures  is  the  difficulty  of  expressing  algorithms 
clearly  and  succinctly.  In  Section  3  we  describe  a  few 
natural  extensions  of  a  hypothetical  C-like  langauge 
that  simplifies  the  description  of  data  parallel  compu¬ 
tations.  This  extended  language  is  unique  in  that  it  al¬ 
lows  one  to  express  cooperative  algorithms  that  are  ex¬ 
ecuted  on  a  heterogeneous  computer  consisting  of  both 
a  Sl.MD  component  and  conventional  single  sequence 
component. 

Having  SIMD  computation  cheaply  available  affects 
the  type  of  algorithms  that  are  used.  Many  of  the  com¬ 
plex  algorithms  developed  for  searching  and  managing 
data  structures  are  no  longer  necessary  because  the 
data  structures  can  be  managed  by  the  DSA  and  all 
elements  handled  in  parallel.  This  is  demonstrated  in 
Section  4  where  we  discuss  a  few  problems  in  compu¬ 
tation  geometry. 

2  The  Data  Structure  Accelerator  Architec¬ 
ture 

The  Data  Structure  Accelerator  (DSA)  is  a  class  of 
SIMD  architectures  that  is  extremely  dense,  easily  scal¬ 
able  and  that  has  been  optimized  to  efficiently  perform 
functions  that  are  difficult  for  conventional  process¬ 
ing  systems.  The  DSA’s  processing  elements  (PE's) 
are  sufficiently  compact  that  machines  with  upwards 
of  10*  processing  elements  are  feasible — well  into  the 
massively  parallel  regime. 

The  processing  elements  are  connected  in  a  low  di¬ 
mension.  rectangular  grid.  This  type  of  interconnect  is 
much  cheaper,  and  can  be  scaled  upwards  much  better 
than  the  boolean  n-cube  network  used  by  machines  like 
the  Connection  Machine.  Although  using  this  simple 
interconnect  scheme  makes  a  few  problems  impractical, 
we  felt  that  the  improved  scale  and  cost  of  the  resulting 
system  more  than  compensated.  For  many  applications 
a  one  dimensional  interconnect  is  sufficient.  For  some 
problems  in  vision  and  computational  geometry  two  di¬ 
mensional  interconnects  are  useful.  In  principle,  higher 
dimensional  interconnects  could  be  used,  but  their  im¬ 
pact  on  pin  count  and  PE/chip  density  is  severe.  For 
now  we  are  only  considering  one  and  two  dimensional 
data  structure  accelerators. 

To  keep  the  processing  elements  small,  we  have  de¬ 
cided  to  restrict  them  to  single  bit  width.  This  mini- 

1  Earlier  version*  of  this  work  at  MIT  referred  to  this  archi¬ 
tecture  as  a  Database  Accelerator.  Since  this  effort  is  directed 
towards  “in  memory”  databases  our  original  choice  of  names  was 
somewhat  misleading.  To  correct  this  we  have  chosen  the  name 
data  structure  acctltrator  which  is  more  suggestive. 


mizes  the  number  of  operations  that  need  to  be  incor¬ 
porated  in  each  PE.  but  they  are  sufficiently  powerful 
for  most  applications.  Because  the  DSA  is  often  used  to 
manage  large  tables,  we  have  included  content  address¬ 
able  memory  in  the  DSA  architecture.  An  example  of 
such  an  element  used  in  a  one  dimensional  linear  ar¬ 
ray  is  shown  in  f  igure  1.  Each  processing  element  is 
called  a  line.  A  line  contains  some  amount  of  content 
addressable  memory  (CAM)  and  random  access  mem¬ 
ory  (RAM).  A  particular  data  structure  accelerator  is 
characterized  by  four  parameters:  the  number  of  lines 
in  the  DSA  (f).  the  dimensionality  of  their  intercon¬ 
nect,  the  number  of  bits  of  CAM  (m)  and  the  number 
of  bits  of  RAM  (n).  The  Self  ct  Word,  which  controls 
which  PE's  participate  in  an  operation,  is  shared  by  all 
the  lines  of  the  DSA. 

Generally,  the  dimensionality  of  the  DSA  is  under¬ 
stood  (it  is  usually  one)  and  the  number  of  lines  is  not 
important  to  the  algorithms.  However,  the  size  of  the 
CAM  and  RAM  structures  is  important.  Thus,  we  say 
that  a  data  structure  accelerator  has  parameters  (m.  n) 
when  each  line  has  m  bits  of  CAM  and  n  bits  of  RAM. 
The  amount  of  CAM  and  RAM  associated  with  each 
line  of  a  DSA  can  be  optimized  for  different  applica¬ 
tions  and  can  range  from  DSA  systems  with  all  CAM 
and  no  RAM  to  all  RAM  and  no  CAM.  The  Smart 
Memories  Project  at  MIT  has  built  a  64  line  (32.4) 
DSA  chip  [6,  8],  a  256  line  (32,4)  DSA  chip  [1]  and  a 
board  that  implements  a  4096  line  (32.4)  DSA.  With 
current  technology,  devices  with  thousands  of  lines  are 
not  impractical  and  systems  in  the  range  of  106  to  10s 
PE's  are  feasible. 

The  data  structure  accelerator  uses  a  three  valued 
logic  for  two  purposes:  to  indicate  which  elements  of 
an  array  execute  particular  instructions  and  within  the 
CAM  cells,  to  increase  their  flexibility.  One  unit  of 
this  three  valued  system  is  called  a  frit.  Each  trit  can 
assume  one  of  the  values  0,  1  or  X.  The  only  binary 
operation  we  perform  with  trits  is  equivalence .  which 
obeys  the  following  “truth"  table. 


~ 

0 

1 

X 

0 

1 

0 

1 

1 

0 

1 

1 

X 

1 

1 

1 

The  DSA  architecture  is  capable  of  performing  five 
basic  instructions:  select,  write,  match,  operate 
and  readout.  The  select  instruction  specifies  which 
processing  elements  participate  in  the  next  sequence 
of  instructions  by  writing  its  operand,  which  is  a  trit 
string,  into  the  Select  Word  of  the  DSA.  Until  the  next 
select  instruction,  only  lines  whose  address  matches 
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Next  line 


Figure  1:  MIT  Database  Accelerator  Architecture 


the  contents  of  the  select  word  perform  any  DSA  op¬ 
eration.  Thus  to  have  all  lines  active,  the  select  word 
is  filled  with  X's.  To  have  the  even  lines  participate,  a 
XXXX  -XO  is  used,  and  if  3,  II,  19  and  27  are  to  par 
ticipate.  the  select  word  will  contain  0---0XX011. 

The  other  four  instructions  are  executed  in  parallel 
by  each  of  the  selected  processing  elements  in  a  SIMD 
fashion.  The  writs  instruction  writes  its  operand  into 
the  CAM  word  of  the  selected  processing  element(s). 
Since  the  Select  Word  can  contain  X’s  a  write  instruc¬ 
tion  can  cause  the  CAM  of  more  than  one  PE  to  be 
modified. 

The  match  instruction  includes  a  data  word  that  is 
matched  against  the  contents  of  the  CAM  words  of 
each  of  the  selected  PE's  using  the  equivalence  func¬ 
tion  given  above.  The  result  of  the  match  is  then  writ¬ 
ten  into  the  MatchLatch  where  it  can  use  used  by  the 
operate  instruction.  No  data  is  transmitted  out  of  the 
DSA  by  a  match  instruction. 

The  operate  instruction  causes  each  selected  PE  to 
perform  a  boolean  operation  on  the  contents  of  two 
registers  and  store  the  result  in  a  third.  This  is  a  three 
operand  instruction.  As  shown  in  Figure  I,  one  of  the 
operands  can  come  from  the  MatchLatch  or  a  register 
of  an  adjacent  PE.  The  result  of  the  boolean  operation 
is  also  latched  by  the  priority  encoder)' 

The  contents  of  the  priority  encoder  are  read  using 
the  readout  instruction.  The  readout  instruction  re¬ 
turns  the  address  of  one  of  the  lines  whose  priority  en¬ 
coder  latch  is  set.  At  the  same  time  it  clears  that  par¬ 
ticular  priority  encoder  latch.  Consequently,  successive 
readout  instructions  return  the  addresses  of  the  lines 
whose  priority  encoders  contain  a  one.  A  special  code 


is  returned  if  all  priority  encoder  latches  contain  zeroes. 

Even  though  a  particular  set  of  parameters  must  be 
chosen  when  building  a  DSA  system,  the  user  can  sim¬ 
ulate  a  DSA  with  a  different  set  of  parameters  at  sur¬ 
prisingly  small  cost  as  shown  in  following  table.  Each 
entry  indicates  the  cost  of  simulate  the  particular  oper¬ 
ation  on  a  DSA  with  parameters  (Jfcm,  In).  This  ability 
to  emulate  DSA’s  with  different  parameters  is  one  of 
the  most  important  differences  between  the  DSA  and 
previous  CAM  designs. 


Operation 

(km.  in) 

(m,n) 

(0.  In) 
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2k-  1 

2 km  -  1 
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i 
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3  An  Algebraic  Language  for  Specifying  DSA 
Operations 

We  do  not  believe  the  DSA  should  be  viewed  as  a  uni¬ 
versal  computing  engine  but  rather  as  a  component  of 
a  heterogeneous  computing  system,  where  the  DSA  is 
used  as  a  slave  of  some  host  processor  or  perhaps  shared 
for  among  several  processors.  The  host,  or  its  delegate, 
is  responsible  for  sequencing  the  DSA  instructions  and 
performing  those  data  operations  for  which  a  SISD  or 
MIMD  machine  is  preferable. 

Thus  algorithms  that  utilize  the  DSA  are  a  mixture 
of  DSA  instructions  and  conventional  single  sequence 
processor  instructions.  Rather  than  expressing  these 
algorithms  in  a  mixture  of  low  level  DSA  instructions 
and  some  high  level  language  for  the  single  sequence 
portion  of  the  instruction  stream,  we  have  developed 
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a  set  of  high  level  extensions  to  an  algebraic  program¬ 
ming  language  in  which  all  the  DSA  operations  can 
be  used  effectively.  This  approach  allows  us  to  in¬ 
tertwine  DSA  operations  with  more  conventional  pro¬ 
gramming  mechanisms  including  multiprocessing  ex¬ 
tensions.  This  section  does  not  give  a  complete  descrip¬ 
tion  of  the  language,  which  is  still  evolving,  but  rather 
describes  the  major  features  and  provides  enough  in¬ 
formation  to  make  the  examples  in  the  later  section 
clear. 

The  components  of  the  DSA  description  language 
have  five  basic  components. 

•  Declarations  that  describe  the  allocation  of  DSA 
lines  to  different  DSA  arrays,  and  the  allocation 
of  CAM  and  RAM  of  the  lines  of  a  DSA  array  to 
various  tasks. 

•  Basic  operations  for  comparing  the  CAM  contents 
with  fixed  data  and  performing  boolean  and  arith¬ 
metic  operations  with  the  contents  of  the  RAM. 

•  Loop  abstractions  that  cause  operations  to  be  per¬ 
formed  on  blocks  of  DSA  lines. 

•  A  mechanism  for  describing  algorithms  best  ex¬ 
pressed  as  state  transition  tables. 

•  A  library  of  higher  level  functions. 

Each  of  these  components  is  described  in  one  of  the 
following  subsections,  ft  is  important  to  notice  that 
our  language  intersperses  DSA  operations  and  conven¬ 
tional  SISD  operations.  YVe  believe  this  allows  our  de¬ 
scription  language  to  be  more  expressive,  and  properly 
leaves  to  the  compiler  the  problems  of  the  separating 
the  operations  that  are  performed  on  the  DSA  form 
those  performed  on  the  host  processor. 

3.1  Declarations 

The  ,V  processing  elements  in  a  DSA  are  identified 
by  their  coordinates  within  their  interconnection  grid. 
These  coordinates  are  used  as  a  subscript.  For  one  di¬ 
mensional  DSA’s  this  is  just  an  integer  from  0  to  A'—  1. 
Higher  dimensional  arrays  use  vector  subscripts. 

For  instance,  the  RAM  of  the  i,h  line  is  written  as 
R*.  while  the  CAM  of  each  line  is  written  as  A'*.  In 
the  case  of  a  one  dimensional  DSA  we  will  denote  the 
RAM  by  /?,  and  the  CAM  by  A',.  In  the  two  dimen¬ 
sional  case  we  use  and  The  individual  bits 

of  the  RAM  can  be  referenced  by  /?*{0],...,  R^p],  The 
Ri.  h\  and  M,  registers  are  not  normally  available  to 
the  programmer.  Instead,  the  programmer  uses  vari¬ 
able  declarations  to  indicate  the  resource  requirements 
of  his  or  her  algorthm,  and  the  compiler  allocates  them 


from  the  available  resources.  This  insulates  the  pro¬ 
grammer  from  the  complications  of  etnulat  ing  resources 
with  what  is  actually  available. 

( 'ollections  of  DSA  lines  are  called  DSA  arrays.  In¬ 
dividual  DSA  arrays  are  allocated  as  if  they  were  ar¬ 
rays.  but  whose  elements  are  declared  using  DSAstruct. 
which  indicates  to  the  compiler  the  RAM  and  (  AM  re¬ 
quirements  of  each  line.  For  instance. 

DSAstruct  interval  { 

CAM  color [5] ; 

RAM  selected,  «in[16],  max[l6]; 

States  5  6  [instde.  outside,  unknown] ; 

) 

defines  the  structure  of  a  line  of  a  DSA  array.  Only 
one  CAM  variable  is  allocated,  color,  which  is  5  bits 
long.  Three  RAM  variables  are  allocated,  two  of  10 
bits  and  one  of  1  bit.  The  States  declaration  indicates 
the  allowable  states  of  each  line  when  programming 
the  DSA  using  state  transition  techniques.  The  state 
transition  techniques  and  the  States  declaration  are 
described  fully  in  Section  3.4.  A  DSA  line  with  this 
structure  will  have  at  least  5  bits  of  CAM  and  35  bits 
of  RAM. 

DSA  arrays  are  collections  of  one  or  more  primitive 
blocks,  where  each  primitive  block  is  a  set  of  2*  DSA 
lines  on  a  2k  boundary.  Blocks  of  DSA  lines  are  only 
allocated  in  sizes  that  are  a  power  of  2  because  of  the  or¬ 
ganization  of  the  decoders.  Odd  sized  DSA  array's  are 
allocated  as  sets  of  primitive  blocks.  Primitive  blocks 
are  identified  by  the  selector  word  that  spans  their  el¬ 
ements.  Thus  IOOXXX2  represents  the  primitive  block 
that  extends  from  lines  32  through  39.  inclusive.  To  in¬ 
dicate  that  the  index  t  lies  within  this  primitive  block, 
we  write  i  6  IOOXXX2. 

A  DSA  array  with  100iO  elements  would  consist  of 
primitive  blocks  of  size  64,  32  and  4.  It  would  be  repre¬ 
sented  by  the  union  of  the  identifiers  for  its  constituent 
primitive  blocks.  For  instance,  for  the  DSA  array  de¬ 
fined  by  the  statement: 

DSAstruct  intsrval  TablsflOO]; 
we  would  have 

Table  =  100XXXXXX2  U  1010XXXXX2  U  UIOOOOXX2 

The  lines  of  Table  each  contain  at  least  5  bits  of 
CAM  and  35  bits  of  RAM.  Subscripts  are  used  to  iden¬ 
tify  lines,  so  the  fifth  line  of  Table,  all  40  bits  of  it 
are  referred  to  as  Table5.  Particular  variables  are  re¬ 
ferred  to  concatenating  the  DSA  array  name  with  ihe 
variable  name,  separated  by  a  slot,  e.g.  Table. min  or 
Table,  min,. 
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Set  intersection  and  complement  can  aiso  be  used  to 
describe  DSA  arrays.  For  instance,  the  even  lines  of 
Table  might  be  denoted  by 

Table  nXXXXXXXX02 

=  (100XXXXXX2  U 1010XXXXX2U  1110000XX2) 

n  xxxxxxxxo, 

=  lOOXXXXXOj  U  IOIOXXXXO2  U  UIOOOOXO2 
and  tlit>  lines  whose  indices  are  not  multiples  of  -I  by 
Table D  XXXXXXX002 

=  (IOOXXXXXX2  U IOIOXXXXX2  U  UIOOOOXX2) 
n (XXXXXXXIX2  u  XXXXXXXOI2) 

=  IOOXXXXOX2  U IOOXXXXOI2  U IOIOXXXIX2 
U IOIOXXXOI2  U  11100001X2  U IIIOOOOOI2 

Each  of  these  three  operations  can  be  performed  for¬ 
mally  on  the  selector  bit  strings  as  follows.  We  consider 
the  union  of  two  selector  bit  strings  R  =  iq  -  •  - r*  and 
S  =  si  •••Si-.  Assume  R  and  S  differ  in  just  one  bit 
position,  so  r;  =  s*  for  i  ^  (.  There  are  then  three 
possibilities: 

{R  if  rt  =  X 

5  if  st  =  X 

t'i  •  •  •  t'f-iXt'r+i  •  •  -r*  otherwise 

The  intersection  of  R  and  S  can  be  performed  on  bit 
by  bit  basis.  Assume  r,  and  s,  differ.  If  neither  is  an 
X  then  the  intersection  of  R  and  5  is  the  empty  set. 
Otherwise,  the  intersection  uses  the  bit  which  is  not 
equal  to  X. 

The  complement  of  R  is  a  union  of  t  bit  strings, 
where  f  is  t he  number  of  0's  and  l's  in  R.  To  see  this 
examine  the  simple  case  R  =  X0002.  We  can  trivially 
write  R  as  a  union  of  7  bit  strings,  where  each  has  one 
X  in  the  same  position.  These  strings  are  listed  in  on 
the  left  hand  side  of  the  double  bars  in  the  table  below. 
On  the  right  hand  side,  are  given  the  three  simplified 
bit  strings  whose  union  is  R. 

XIOO2  X1012  XIIO2  XUI2  II  XIXX2 
XOIO2  XOII2  X01X2 

XOOlj  ||  XOOl? 

These  rules  allow  us  to  reduce  all  combinations  of 
unions,  intersections  and  complements  of  selector  bit 
strings  to  unions  of  bit  strings,  i.e.,  conjunctive  normal 
form. 


3.2  Basic  Operations 

Boolean  and  arithmetic  operations  with  arbitrary  sized 
RAM  variables  can  be  implemented  in  a  bit  serial  fash¬ 
ion  using  the  function  generator.  Thus  we  permit 
RAM  variables  to  be  combined  using  any  of  the  stan¬ 
dard  boolean  and  arithmetic  operations,  provided  their 
lengths  are  compatible. 

Consider  the  following  code  sequence: 

DSAetruct  Sample  { 

RAM  A[ 3],  B[2),  C‘[2] ; 

}  S; 

S.A,  -  S.fl,  +  S.C,  ; 


The  compiler  might  allocate  the  variables  A.  D  and  C 
in  RAM  as 


At  ! 

B \ 

ft[0]  |  ftp]  ft [2] 

~rW~bW 

R,[  5]  ft  [6] 

Then  the  statement  A,  —  ft  +  C,  would  be  expanded 
into  code  equivalent  to 


ft[ 0]  -  ft  [3]  A)  ft  [5] ; 

Ri  [carry] «—  ft  [3]  A  ft  [5]; 
ft[l]  -  ft[4]eft[6]; 

Ri  [carry]  —  (ft  [4]  A  /?.[6])  V  (ft  [4]  A  ft  [carry]) 
V(ft[6]  A  ft[carry]); 
ft  [2]  —  ft  [carry]; 

3.3  Selection  and  Querying 

Groups  of  instructions  are  encapsulated  in  a  loop-like 
block  structure  to  indicate  that  they  should  be  per¬ 
formed  by  a  number  of  processing  elements  in  parallel. 
An  example  of  this  form  is: 

ForEach  i,  (boolean  expression  in  »)  { 

(forms  involving  i) 

} 

The  body  forms  are  performed  for  each  i  that  satis¬ 
fies  the  boolean  predicate.  The  simplest  boolean  pred¬ 
icate  just  indicates  that  i  is  an  element  of  a  particular 
set.  For  instance,  the  following  code  segment  performs 
a  calculation  on  the  registers  of  the  32  even  lines  in  the 
range  0  to  03. 

ForEach  1,  i  6  OXXXXXOj 

ft[0]  —  (ft[0]  A  ft[l])  V  ft [2] 

This  particular  code  segment  will  be  expanded  into  a 
select  instruction  to  set  the  SelectWord  to  0XXXXX02 
and  a  few  operate  instructions  for  the  body  of  the  loop, 

select  OXXXXXOj 
ft[te«p]  —  ft[0]  A  ft  [l] 
ft[0]  —  ft  [tamp]  V  ft  [2] 


Odd  sized  DSA  arrays,  like  the  100  entry  Table  given 
in  Section  3.1.  are  dealt  with  by  generating  a  DSA  de¬ 
scriptor  t hat  is  a  union  of  selector  bit  strings.  Then 
the  body  of  the  loop  is  repeated  for  each  bit  string. 
For  instance,  the  sequence 

ForEaeh  i, 

i  G  IOOXXXXXX2  U  IOIOXXXXX2  U  IUOOOOXX2  { 
(forms  involving  i) 

) 

would  be  treated  as: 

ForEaeh  i,  i  G  IOOXXXXXX2  { 

(forms  involving  i) 

} 

ForEaeh  i.  ?'€  IOIOXXXXX2  ( 

(forms  involving  »') 

I 

ForEaeh  i,  i  G  IIIOOOOXX2  { 

(forms  involving  i) 

J 

Notice  that  even  though  a  ForEaeh  loop  evaluates  its 
body  at  each  line  of  the  DSA  array,  the  time  required 
for  the  loop  is  typically  0(  1 ),  where  the  constant  of  pro¬ 
portionality  is  the  time  required  to  perform  the  body 
once.  In  the  worst  case,  where  the  size  of  the  DSA  ar¬ 
ray  is  close  to  a  power  of  2.  the  loop  will  be  performed 
O(log.V)  times  for  a  DSA  array  with  N  lines. 

Consider  the  following  chunk  of  code. 

ForEaeh  i,  (i  €  OXXXXXO2 )  A  (A';  =  Tast)  { 

print!  ("Lina  %d  aatchad" ,  i) ; 

} 

The  predicate  for  this  loop  is  a  bit  more  complex. 
The  body  is  performed  for  each  of  the  even  lines  in  the 
range  0  to  63  whose  CAM's  contents  match  Test.  This 
predicate  is  expanded  into  three  instructions:  a  select 
instruction  that  sets  the  SelecIWord  and  a  match  in¬ 
struction  for  I\\  =  Test.  In  addition  the  result  of  this 
match  is  stored  in  a  register  (/(.[loop])  for  later  use. 

The  first  statement  in  the  body  is  a  simple  operate 
cycle,  except  that  it.  is  only  supposed  to  take  effect  on 
those  lines  that  /(.[loop]  =  1.  This  is  accomplished 
by  conditionalizing  writes  in  the  loop  on  the  value  of 
/?i[loop].  Thus  the  first  line  of  the  body  expands  into: 

W,[l]  as  (/(.[loop]  A  /?i[l]  A  fl,[2])  V  ( /(.[loop]  A  R,[l]): 

The  final  statement  in  the  body  actually  expands 
into  a  loop.  First  an  operate  instruction  is  issued  to 
store  the  contents  of  the  /(.[loop]  in  the  priority  en¬ 
coder  register.  Then  a  readout  instruction  occurs  for 


each  of  the  designated  lines,  followed  by  the  code  for 
the  print!  statement  which  uses  the  value  returned  by 
the  successful  readout  instructions. 

The  set  predicates  available  for  use  in  a  ForEaeh 
statement  include  multibit  tests,  multiple  matches  and 
arithmetic  comparisons,  which  are  discussed  in  Sec¬ 
tion  3.5. 

3.4  State  Transitions 

A  common  use  of  the  processing  elements  of  the  DSA 
is  as  a  state  machine.  To  make  this  a  bit  easier  for  the 
programmer  and  to  make  the  resulting  programs  a  bit 
more  intelligible,  we  have  decided  to  have  the  compiler 
allocate  the  binary  patterns  for  states  and  work  out  the 
state  transition  equations.  This  is  accomplished  with 
two  new  types  of  statements. 

A  new  set  of  states  is  introduced  by  the  States  form. 

States  5  G  {up.  down,  sideways} ; 

This  statement  declares  S  to  be  a  state  identifier  for 
each  PE.  The  state  of  any  particular  PE  is  indicated  by- 
adding  a  subscript.  Thus  the  state  of  the  5th  processor 
is  S5.  If  we  wanted  to  find  all  the  processing  elements 
which  are  in  the  up  state,  we  would  use  the  following 
code  segment: 

ForEaeh  i,  5,  =  up 

print!  ("Lina  %d  is  up,  i) ; 

Occasionally,  computations  may  involve  more  than 
one  set  of  orthogonal  states.  In  this  case,  several  state 
variables  are  declared,  as  in  the  following  example. 

Status  5  G  { up,  down,  sideways] ; 

Status  T  G  [red  yellow,  blue}; 

With  these  declarations,  each  PE  could  be  in  one  of 
nine  different  states.  We  also  say  that  the  S-siate  of  a 
processing  element  is  5,  and  that  its  T -si ate  is  T,. 

States  can  be  changed  by  using  the  HewState  form. 
The  HeuStatu  form  identifies  which  state  variable  is 
to  be  changed  and  has  body  consisting  of  a  set  clauses 
indicating  how  to  change  the  state.  For  instance,  we 
might  have 

Status  5  G  { up,  down,  sideways] ; 

ForEaeh  i,  i  G  "XX... XX”  { 

IsuStatu  5  { 

up:  Si  •—  down; 
down :  if  A';  =  "X. .  XUX" 
than  S;  —  up; 
ulsu  S;  <—  sideways; 
othurwisu:  5.  *—  sideways; 

} 

} 
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If  the  compiler  chose  to  ttse  registers  /?, [0]  and  /?, [  1  ] 
to  hold  t  he  state  of  each  processing  element  ,  as  follows: 

State  /?,(0]  /?,[!] 


up 

0 

0 

down 

0 

1 

side  ways 

1 

X 

There  are  three  different  binary  inputs  to  the  truth 
table  for  this  state  transition,  the  two  state  variables 
R,[ 0]  and  £?,[0]  and  the  result  of  the  match  A',  = 
"X...X11X",  which  we  denote  by  .1/, ■  This  gives  the 
following  Karnough  map: 


00 

01 

11 

10 

Mi 

01 

00 

IX 

IX 

Mi 

01 

IX 

IX 

IX 

To  minimize  the  the  number  of  produce  terms  we  use 
the  following  assignment  of  the  X's. 


00 

01 

11 

10 

Mi 

01 

00 

10 

11 

01 

10 

10 

11 

Thus  the  original,  state  transition  code  is  equivalent 
to 

ForEach  i,  i  €  "XX. .  XX"  { 

Ri[  0]  -  ((A',  £  "X...XUX")A/?,[0])V  Ri[Q]i 

} 

which,  though  somewhat  shorter  textually,  is  signifi¬ 
cantly  less  clear  than  the  state  transition  code  given 
earlier. 

3.5  Arithmetic  Comparisons 

One  of  the  fundamental  applications  of  the  data  struc¬ 
ture  accelerator  is  searching  tables  quickly.  The  con¬ 
tent  addressable  memory  accelerates  searches  involving 
boolean  patterns,  while  the  function  generator  acceler¬ 
ates  searches  involving  arithmetic  patterns.  For  exam¬ 
ple,  we  can  determine  those  lines  of  a  DSA  that  con¬ 
tain  a  number  lying  between  given  bounds,  or  the  line 
of  the  DSA  that  contains  the  largest  quantity  in  time 
independent  of  the  number  of  lines  being  searched. 

For  simplicity  we  assume  that  each  line  of  the  DSA 
contains  precisely  one  m-bit  key.  This  key  can  lie  either 
in  the  CAM  portion  of  the  DSA  or  in  the  RAM  portion. 
In  this  section  we  assume  the  key  lies  in  the  CAM,  but 
trivial  modifications  of  the  algorithms  enable  it  work 
with  the  key  in  the  RAM.  We  consider  two  fundamen¬ 
tal  operations,  comparison  with  a  given  quantity  and 
finding  the  largest  element  of  a  set  of  keys. 


The  following  function  identifies  each  line  whose  key 
is  greater  than  LowerBound.  This  is  done  by  examining 
each  of  the  bits  of  the  key  in  sequence.  Fach  DSA  line 
can  be  one  of  three  states:  greater,  tessi  r  and  unknown. 
In  stale  lesser  the  entry  is  known  to  be  less  than  the 
key:  in  the  greate  estate  it  is  known  to  be  greater  than 
the  key,  and  instate  unknown  we  still  don't  know.  If  an 
entry  is  in  state  unknown  then  its  leading  bits  match 
the  leading  bits  of  the  key  that  have  been  presented  so 
far. 

All  words  are  initially  placed  in  th“  unknown  state. 
We  compare  the  contents  of  the  CAM  with  LowerBound 
one  bit  at  a  time,  changing  the  state  of  the  line  as 
necessary.  The  state  transition  diagram  is  shown  in 
Figure  2.  At  the  end  of  m  cycles,  any  word  still  in 
state  unknown  is  equal  to  the  key. 

The  following  program  uses  two  special  arrays. 
BitMask[n]  contains  a  1  in  the  n,h  bit  position,  from 
highest  to  lowest.  Thus  LowerBound  A  BitMaskfn]  se¬ 
lects  the  n,h  bit  from  LowerBound.  FieldHask  is  simi¬ 
lar.  but  has  the  n  highest  bits  set. 

CoaparafArray,  LowerBound)  { 

States  Array.  5  €  {greater,  lesser,  unknown)  •, 
ForEach  i ,  i  6  Array  { 

Array.  Si  —  unknown; 
for  0  <  n  <  HatchVidth  { 

KewState  5  { 
unknown: 

if  Array .  h\  s 

(LowerBound  A  Fi*ldMask[rr] ) 
then  Array. S,  —  unknown; 
else  if  0  =  LowerBound  A  BitNask[»] 
then  Array. 5,  —  greater; 
else  Array. 5i  —  lesser; 
otherwise:  Array. S;  —  Array .5, ; 

) 

} 

} 

} 

This  and  similar  routines  can  be  used  by  the  com¬ 
piler  to  implement  arithmetic  predicates  in  ForEach 
statements.  For  instance,  one  might  want  to  use  the 
DSA  to  represent  a  large  set  of  one  dimension  inter¬ 
vals.  The  following  code  segment  would  then  be  used 
to  find  those  intervals  that  contain  the  origin. 

DSAstruct  Saaplw  { 

CAM  £.[16],  R[16]j 
}  S; 

ForEach  »,  (S .Li  <  0)  A  (S .£?,  >  0) 

printf ("Intsrral  %d  -ontains  ths  origin.", 

«> ; 


Figure  2:  State  Diagram  for  Comparison 


For  example,  we  might  want  to  use  the  DSA  to  rep¬ 
resent  a  set  of  intervals  from  0  to  216  —  1.  Each  interval 
would  be  represented  as  one  line  of  a  (32,  n)  DSA,  with 
half  of  K,  used  to  hold  the  lower  li  lit  of  the  interval 
and  half  for  the  upper  limit.  A  modest  sized  data  struc¬ 
ture  accelerator  could  then  contain  a  rather  large  set  of 
these  intervals.  Using  this  algorithm,  we  can  determine 
the  intervals  in  this  set  that  intersect  a  given  interval 
in  time  linear  in  the  size  of  the  intersection.  This  tech¬ 
nique  trivially  extends  to  two  or  more  dimensions. 

3.6  A  Space/Time  Tradeoff 

One  unfortunate  effect  of  the  linear  nearest  neighbor 
interconnection  network  of  the  data  structure  acceler¬ 
ator  is  that  the  (graph  theoretic)  diameter  of  a  data 
structure  accelerator  of  n  lines  is  n.  Thus  computa¬ 
tions  that  require  data  in  distant  lines  be  combined 
can  be  quite  slow.  The  boolean  n-cube  type  networks 
used  by  the  Connection  Machine  [3]  have  diameter  log  n 
and  thus  can  perform  somewhat  better  with  these  al¬ 
gorithms.  Unfortunately,  boolean  n-cube  networks  do 
not  scale  to  large  numbers  of  processing  elements.  In 
fact,  networks  with  diameter  less  than  n1/3  cannot  be 
embedded  in  3-space  in  a  uniformly  scalable  fashion. 
Power  distribution  and  heat  dissipation  considerations 
raise  this  bound  to  n1/2  and  packaging  considerations 
to  n.  Thus  algorithms  that  require  a  high  degree  of 
communication  among  O(n)  processing  elements  will 
require  more  than  O(n)  “real  estate"  in  realizable  sys¬ 
tems. 

This  section  shows  how  to  solve  such  a  problem  using 
0(n2)  lines  of  a  DSA.  Notice  that  while  fewer  proces¬ 
sors  could  be  used  if  a  smaller  diameter  network  were 
used,  it  is  not  clear  that  less  total  hardware  would  be 
required. 

Consider  the  following  problem:  Given  a  set  of  n 

integers  {ao . «n-t}.  find  those  pairs  that  have  the 

minimum  difference.  The  brute  force  approach  of  com¬ 
paring  all  pairs  of  integers  requires  0(n2)  operations. 


However,  this  can  be  reduced  to  0(n  logo)  operations 
by  first  sorting  the  integers  an  then  comparing  their 
neighbors.  Using  a  DSA  we  can  solve  this  problem  us¬ 
ing  O(n)  space  and  O(n)  time  in  the  following  fashion. 
First,  store  each  a,  in  a  line  of  the  DSA.  Then,  in  par¬ 
allel,  compute  the  difference  between  the  contents  of 
each  line  and  a0.  Repeat  this  for  each  retaining  the 
smallest  difference.  This  will  take  O(n)  operations.  Fi¬ 
nally,  the  smallest  difference  is  determined  using  the 
techniques  of  Section  3.5.  This  requires  0(1)  opera¬ 
tions.  Thus  this  problem  can  be  solved  in  O(n)  time 
by  using  O(n)  lines  of  a  DSA.  The  computational  com¬ 
plexity  of  this  solution  is  still  0(n2)  (O(n)  processors 
and  0{n)  time),  the  same  as  the  simple  algorithm,  but 
we  have  achieved  a  modest  speed  up  (O(logn))  while 
continuing  to  use  a  straightforward  algorithm. 

An  alternative  approach  is  to  store  in  each  of  n2  lines 
of  the  DSA  the  pairs  (a,,  a;-).  Then  the  n 2  differences 
can  be  computed  in  parallel  using  0(1)  operations. 
This  may  be  useful  approach  if  many  such  calculat  ions 
with  the  same  set  of  a,  are  to  be  performed.  The  only- 
problem  is  to  get  the  data  into  the  DSA  efficiently. 

Observe  that  the  n2  entries  in  the  DSA  can  be  writ¬ 
ten  using  only  O(n)  operations  by  using  the  selector 
carefully.  Assume  that  n  =  2*,  We  begin  by  writting 
a,  in  the  n  lines  beginning  with  line  in.  Then  write  a, 
in  lines  i,  n  +  i ,  2n  +  i  and  so  on.  Each  of  these  two 
passes  requires  O(n)  write  operations  so  the  entire  n2 
array  can  be  set  up  in  0(n)  time. 

The  following  code  fragment  implements  this  proce¬ 
dure  for  a  4  x  4  array.  For  simplicity,  we  have  (unre¬ 
alistically)  assumed  that  each  of  the  a*  is  a  single  bit. 
Notice  that  each  line  is  a  single  DSA  instruction.  Fig¬ 
ure  3  illustrates  the  operation  of  this  technique  when 
applied  to  2  x  2  case. 

ForEach  »,  «  €  "OOXX" 
fl;[l*ft]  —  a0: 

ForEach  i,  i  €  "01XX" 

Ri [laft]  —  ai ; 
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Figure  ;}:  Intermediate  states  while  creating  a  cross 
product 

ForEach  i,  i  €  "10XX” 
ft,[l*ft]  —  ; 

ForEach  i,  t  <=  "UXX" 

A,  [left]  —  as ; 

ForEach  i,  i  6  "XXOO" 
ft,  [right]  —  no ; 

ForEach  i,  i  €  "XXOl" 
ft, [right]  —  ai ; 

ForEach  i.  »  €  "XX10" 
ft,  [right]  —  d2  ; 

ForEach  i,  i  €  "XXU" 
ft,  [right]  —  d3 ; 

4  Problems  in  Computational  Geometry 

In  this  section  we  demonstrate  how  the  data  structure 
accelerator  can  he  used  to  solve  several  problems  in 
computational  geometry.  Since  problems  in  computa¬ 
tional  geomet  ry  typically  arise  as  a  component  of  other 
applications  (such  as  VLSI  or  mechanical  CAD),  we 
think  the  use  of  the  data  structure  accelerator  is  par¬ 
ticular  appropriate.  It  is  relatively  inefficient  to  de¬ 
sign  specialized  hardware  to  solve  the  computational 
geometry  problems  that  arise  in  CAD,  because  they 
are  just  one  component  of  a  larger  computational  prob¬ 
lem.  And  yet  there  is  a  huge  amount  of  potential  paral¬ 
lelism  to  be  exploited.  The  data  structure  accelerator 
provides  that  parallelism  in  a  fashion  that  is  not  spe¬ 
cialized  to  the  problems  of  computational  geometry.  At 
the  same  time,  these  techniques  require  using  the  data 
structure  accelerator  in  tandem  with  regular  process¬ 
ing  elements.  This  is  precisely  the  type  of  cooperative 
heterogeneous  computation  discussed  in  the  introduc¬ 
tion. 

4.1  Convex  Polygon  Inclusion 

The  convex  polygon  inclusion  problem  is  relatively 
straightforward.  A  convex  polygon  is  described  by  the 
sequence  of  its  vertices,  as  shown  in  Figure  4(a).  We  are 
to  determine  if  a  given  trial  point  is  contained  within 
the  polygon.  The  basic  relationship  we  use  is  illustrated 
in  Figure  4(b).  If  we  denote  the  x  and  y  coordinates 
of  the  point  P,  by  x,  and  y,,  the  signed  area  of  the 


triangle  P1P2P3  is 


l  ^  1 

1 

1  X2 

tti 

1 

=  xiM  +  r-iy3+*3y\ -■>'  lA 

x  i  // 1  —J'Ph 

I 

1/3 

1 

The  sign  of  the  area  is  positive  if  the  points  l\.  P, 
and  pj  are  arranged  counterclockwise  in  the  plane  [.*>]. 
Thus  in  Figure  4(b). the  triangle  P1P2P3  has  positive 
area,  while  the  triangle  PiP^P)  has  negative  area. 

To  determine  if  a  trial  point  P  is  contained  within  a 
polygon  we  check  that  the  triangles  formed  by  P  and 
each  edge  of  the  polygon  have  positive  area.  This  is 
easily  done  by  assigning  each  edge  of  the  polygon  to  a 
line  of  the  DSA: 

DSAarray  LinaSegaent  { 

RAMLxtfl.  Lytfl.  Rx[f] ,  Ry[f],  ; 

Ar«a[2(]  ; 

}: 

We  have  allocated  four  f-bit  quantities  to  hold  the 
coordinates  of  the  endpoints,  and  one  2f-bit  quantity 
for  the  area  of  the  triangle  in  the  computation. 

Polygonlndusion  (Edges,  Px,  Py) 

DSAarray  LineSegaent  Edges [];  { 

ForEach  i,  i  £  Edges  { 

Edges.  Area, 

—  Edges .  Lx,  x  Edges .  Ry,  +  Edges .  Rx,  x  Py 
+  Px  x  Edges .  Ly,  -  Edges  .Lx,  x  Py, 

-  Edges .  Rx,  x  Edges .  Ly,  -  Px,  x  Edges .  Ry, ; 
if  3j.  (Edges. Areaj  <  0) 

then  return(  "Outside") ; 
else  retum(  "Inside”) ; 

} 

} 

The  time  required  by  this  algorithm  is  independent 
of  the  number  of  edges  of  the  polygon(s),  but  due  to 
the  multiplications  in  the  area  computation,  quadratic 
in  the  number  of  bits  required  to  represent  the  coordi¬ 
nates  of  the  vertices  0((2).  We  could  say  that  the  time 
is  0((  logf)  by  using  an  FFT  algorithm,  but  this  would 
only  be  of  theoretical  interest  and  ignores  the  perfor¬ 
mance  cost  of  getting  a  larger  algorithm  to  the  DSA. 
In  addition,  we  must  count  the  preprocessing  time  re¬ 
quired  to  load  the  n  vertices  into  the  DSA.  which  is 
O(in),  since  each  point  has  size  0(().  The  time  to 
check  m  trial  points  grows  to  0(mC2)).  while  the  pre¬ 
processing  time  remains  fixed  (using  classical  multipli¬ 
cation). 

If  the  number  of  trial  points  is  large,  or  the  same  trial 
points  are  to  be  used  repeatedly  with  different  sets  of 
polygons,  we  can  store  the  trial  points  in  the  DSA  and 
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(a) 


Figure  4:  Polygon  Inclusion 


perform  the  computation  for  different  polygons.  In  this 
case  the  preprocessing  time  becomes  O(ftn)  while  the 
computing  time  becomes  0(n(2)). 

Classical  algorithms  [5]  require  O(fn)  time  for  pre¬ 
processing  and  answer  the  inclusion  question  for  w  trial 
points  in  time  0(£2m  logn).  For  comparison,  these  re¬ 
sults  are  summarized  below. 


vertices  in  DSA 
trial  points  in  DSA 
classical 

The  DSA  approach  uses  a  straightforward  algorithm 
and  achieves  somewhat  better  performance  than  the 
classical  techniques.  The  following  section  discusses  a 
variant  of  this  problem  where  the  test  points  lie  in  a 
regular  grid. 

4.2  Polygon  Filling 

A  common  primitive  for  a  number  of  geometric  algo¬ 
rithms  is  polygon  filling.  In  the  two  dimensional  case, 
we  are  given  a  rectangular  section  of  the  plane  dis¬ 
cretized  into  an  m  x  n  grid.  Within  this  grid  are 
marked  the  boundaries  of  a  number-  of  connected  re¬ 
gions.  The  polygon  filling  problem  is  to  identify  the 
regions  in  which  each  point  of  the  grid  is  contained. 

A  simple  example  of  this  problem  arises  in  computer 
graphics  where  the  regions  might  represent  homoge¬ 
neous  regions  of  an  image,  e.g.  surfaces  of  objects. 
When  the  image  is  presented  on  the  screen,  each  pixel 
within  each  region  needs  to  be  painted  with  the  same 
color. 

Figure  5  shows  a  two  dimensional  region  embedded 
in  a  grid.  Each  dot  represents  a  single  processing  el¬ 
ement  of  a  two  dimensional  DSA.  The  diagonal  lines 
form  the  true  boundary  of  the  region,  while  the  black 
dots  indicate  the  boundary  on  the  grid.  Notice  that 


Preprocessing  Query 


in 

tm 

In 

log  n 

each  black  dot  lies  on  or  within  the  boundary  of  the 
region.  Given  such  a  boundary  we  can  propagate  a 
seed  node  outward  until  it  reaches  a  boundary.  This  is 
illustrated  in  Figure  5  where  the  seed  node  is  at  (6.5). 
At  <o  *f  is  the  only  node  marked.  Between  t,  and  f1  +  i 
each  marked  node  propagates  a  mark  to  each  of  its  four 
neighbors  if  they  are  (1)  not  already  marked  and  (2) 
not  a  element  of  the  boundary.  The  time  at  which  each 
node  is  marked  is  given  in  the  figure.  In  this  case  every 
node  in  the  region  that  is  orthogonally  connected  to 
(6,5)  is  marked  in  7  units  of  time. 

The  data  structure  used  to  model  the  grid  is  defined 
as  follows. 

DSAstruct  FillGrid  { 

States  S  €  { interior ,  exterior,  boundary,  unknown) 
}  GridCn ,  n] j 


Each  node  in  the  grid  can  be  in  one  of  four  states: 
interior,  exterior,  boundary  and  unknown.  Initially 
each  node  is  placed  in  the  unknown  state.  The  bound¬ 
ary  is  then  defined  by  place  each  node  on  or  just  inside 
the  boundary  to  the  boundary  state.  The  orientation 
of  the  boundary  is  defined  by  setting  one  node  inside 
the  region  to  the  interior  state.  This  node  serves  as  a 
seed  that  spreads  throughout  the  region. 

The  following  block  of  code  then  propagates  the  seed 
throughout  the  interior. 

Propogata  (Grid)  { 
for  l  <  n  { 

ForEach  (i.j),  (i.j)  €  Grid  { 

MawStata  Grid. 5  { 
unknown: 

if  (Grid.Si+ij  =  interior) 

V(Grid.S,,>+i  =  interior) 
V(Grid.S,'_i.j  =  interior) 
V(Grid.S,j_i  =  interior ) 
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Figure  5:  Sample  linage 


then  Grid.S,-.,  —  interior; 
else  Grid. Si.j  —  unknown; 
otherwise:  Grid.S;j  —  Grid. 5, -j  : 

} 

} 

} 

} 

This  routine  assumes  the  orthogonal  distance  be¬ 
tween  any  two  connected  nodes  is  no  more  than  n. 
This  is  the  case  for  convex  polygons,  but  serpentine 
(concave)  polygons  can  be  created  whose  interiors  have 
minimal  orthogonal  paths  of  length  0(n2). 

5  Conclusions 

This  paper  has  discussed  a  SIMD  architecture  that  is 
designed  to  be  used  as  an  integral  component  of  a  het¬ 
erogeneous  highly  parallel  machine.  We  have  described 
a  few  basic  algorithms  for  using  the  data  structure  ac¬ 
celerator  and  provided  a  language  for  describing  other 
algorithms.  The  major  observation  we  make  is  not  that 
the  DSA  yields  enormous  speed-ups  over  optimal  se¬ 
quential  algorithms,  but.  rather  we  get  a  modest  speed¬ 
up  over  the  exceedingly  complex,  optimal  algorithms  by 
using  a  straightforward  algorithms  on  the  DSA.  Thus 
we  claim  to  have  improved  the  complexity-speed/cost 
product. 

This  paper  benefited  from  the  comments  of  Laurie 
Hendren.  James  Stewart  and  Steve  Vavasis.  Paul  Chew 
has  corrected  a  number  oversights  and  generally  im¬ 
proved  content  and  presentation  of  this  paper. 
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